A Comparative Study of Topic Identification on Newspaper and E-mail

نویسندگان

  • Brigitte Bigi
  • Armelle Brun
  • Jean Paul Haton
  • Kamel Smaïli
  • Imed Zitouni
چکیده

This paper presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classifier, topic perplexity, and weighted model. Our work aims to study these methods by confronting them to very different data. This study is very fruitful for our research. Statistical topic identification methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identification of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Field Gamma-ray Spectrometry by NaI(Tl) and HPGe Detectors in the South Caspian Region

Natural radionuclides present in soil as well as certain anthropogenic radionuclides released to the environment are the major contributors to terrestrial outdoor exposures. In the assessment of human exposures from environmental radioactivity, besides the conventional method of soil and vegetation sampling combined with laboratory based analyses of environmental media, the other choice would b...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

‍Comparative leaf anatomy of the Inula species (Asteraceae: Inuleae) in Iran

The genus Inula L. belongs to tribe Inuleae (Asteraceae) with about 100 species in the world. In this study, leaf in cross section and epidermis in superficial view were studied. Twenty one characters were used for separating the species with an identification key. The most important characters like kind of hairs, shape of epidermal cells on both surfaces, stomata in superficial view and in cro...

متن کامل

A Study of the Role of Using E-mail in Improving High School Students’ EFL Writing Skill

The present study investigates the effect of e-mail on Iranian learners of English and focuses on teaching thewriting skillvia e-mail. More specifically, the study investigates (a) whether using e-mail has any statistically significant effect on improving high school students' writing skill, and (b) whether the proficiency level has any relation with students’ writing improvement through using ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001